Research Question¶
How does aneuploidy impact the identity and functional state of early human embryonic cells?
Dataset¶
Title: Single-Cell RNA-Seq Reveals Lineage and X Chromosome Dynamics in Human Preimplantation Embryos*
Citation: Petropoulos et al., 2016
DOI: 10.1016/j.cell.2016.03.023
Accession: E-MTAB-3929 (ArrayExpress)
Key Features¶
| Attribute | Description |
|---|---|
| Organism | Homo sapiens (Human) |
| Assay Type | Single-cell RNA sequencing (scRNA-seq) |
| Developmental Stage | Embryonic days E4–E7, spanning 8-cell to blastocyst stages |
| Lineages Covered | Trophectoderm (TE), Epiblast (EPI), Primitive Endoderm (PrE) |
Author¶
William Mwine
Data Exploration¶
Here we have an initial look at the counts matrix.
import scanpy as sc
import pandas as pd
import numpy as np
counts = pd.read_csv("data/counts.txt", sep="\t")
counts
| Unnamed: 0 | E5.5.101 | E5.5.100 | E6.2.114 | E6.2.104 | E6.2.107 | E6.2.116 | E7.2.138 | E6.2.118 | E6.2.105 | ... | E3.50.3415 | E3.51.3421 | E3.53.3437 | E3.51.3423 | E3.52.3429 | E3.49.3407 | E3.51.3426 | E3.47.3391 | E3.52.3431 | E3.53.3438 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A1BG | 0 | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 0 | ... | 327 | 2167 | 170 | 451 | 104 | 446 | 2517 | 473 | 104 | 116 |
| 1 | A1BG-AS1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 88 | 0 | 0 | 0 | 0 | 1 | 7 | 0 | 0 |
| 2 | A1CF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 39 | 86 | 0 | 0 | 15 | 11 | 94 | 40 | 66 | 34 |
| 3 | A2M | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | A2M-AS1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26173 | ZYG11A | 5 | 16 | 22 | 2 | 1 | 48 | 0 | 3 | 111 | ... | 90 | 515 | 70 | 295 | 34 | 219 | 221 | 97 | 25 | 188 |
| 26174 | ZYG11B | 0 | 1 | 19 | 1 | 0 | 8 | 0 | 23 | 0 | ... | 0 | 133 | 0 | 52 | 0 | 0 | 0 | 98 | 0 | 0 |
| 26175 | ZYX | 17 | 17 | 9 | 16 | 26 | 88 | 2301 | 1 | 0 | ... | 17 | 492 | 39 | 87 | 38 | 71 | 185 | 14 | 4 | 160 |
| 26176 | ZZEF1 | 0 | 1 | 70 | 8 | 23 | 5 | 240 | 16 | 0 | ... | 50 | 120 | 86 | 1 | 199 | 0 | 65 | 0 | 136 | 144 |
| 26177 | ZZZ3 | 12 | 7 | 110 | 4 | 1 | 12 | 306 | 36 | 326 | ... | 261 | 791 | 146 | 1705 | 295 | 996 | 740 | 986 | 318 | 0 |
26178 rows × 1530 columns
counts.rename(columns={"Unnamed: 0": "gene_names"}, inplace=True)
counts = counts.set_index("gene_names")
counts
| E5.5.101 | E5.5.100 | E6.2.114 | E6.2.104 | E6.2.107 | E6.2.116 | E7.2.138 | E6.2.118 | E6.2.105 | E7.2.144 | ... | E3.50.3415 | E3.51.3421 | E3.53.3437 | E3.51.3423 | E3.52.3429 | E3.49.3407 | E3.51.3426 | E3.47.3391 | E3.52.3431 | E3.53.3438 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gene_names | |||||||||||||||||||||
| A1BG | 0 | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 0 | 0 | ... | 327 | 2167 | 170 | 451 | 104 | 446 | 2517 | 473 | 104 | 116 |
| A1BG-AS1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 88 | 0 | 0 | 0 | 0 | 1 | 7 | 0 | 0 |
| A1CF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 39 | 86 | 0 | 0 | 15 | 11 | 94 | 40 | 66 | 34 |
| A2M | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A2M-AS1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ZYG11A | 5 | 16 | 22 | 2 | 1 | 48 | 0 | 3 | 111 | 10 | ... | 90 | 515 | 70 | 295 | 34 | 219 | 221 | 97 | 25 | 188 |
| ZYG11B | 0 | 1 | 19 | 1 | 0 | 8 | 0 | 23 | 0 | 13 | ... | 0 | 133 | 0 | 52 | 0 | 0 | 0 | 98 | 0 | 0 |
| ZYX | 17 | 17 | 9 | 16 | 26 | 88 | 2301 | 1 | 0 | 904 | ... | 17 | 492 | 39 | 87 | 38 | 71 | 185 | 14 | 4 | 160 |
| ZZEF1 | 0 | 1 | 70 | 8 | 23 | 5 | 240 | 16 | 0 | 0 | ... | 50 | 120 | 86 | 1 | 199 | 0 | 65 | 0 | 136 | 144 |
| ZZZ3 | 12 | 7 | 110 | 4 | 1 | 12 | 306 | 36 | 326 | 61 | ... | 261 | 791 | 146 | 1705 | 295 | 996 | 740 | 986 | 318 | 0 |
26178 rows × 1529 columns
adata = sc.AnnData(X=counts.T)
adata.var_names_make_unique()
adata.var_names
Index(['A1BG', 'A1BG-AS1', 'A1CF', 'A2M', 'A2M-AS1', 'A2ML1', 'A2MP1',
'A3GALT2', 'A4GALT', 'A4GNT',
...
'ZWILCH', 'ZWINT', 'ZXDA', 'ZXDB', 'ZXDC', 'ZYG11A', 'ZYG11B', 'ZYX',
'ZZEF1', 'ZZZ3'],
dtype='object', name='gene_names', length=26178)
adata.obs_names
Index(['E5.5.101', 'E5.5.100', 'E6.2.114', 'E6.2.104', 'E6.2.107', 'E6.2.116',
'E7.2.138', 'E6.2.118', 'E6.2.105', 'E7.2.144',
...
'E3.50.3415', 'E3.51.3421', 'E3.53.3437', 'E3.51.3423', 'E3.52.3429',
'E3.49.3407', 'E3.51.3426', 'E3.47.3391', 'E3.52.3431', 'E3.53.3438'],
dtype='object', length=1529)
adata.X
array([[ 0, 0, 0, ..., 17, 0, 12],
[ 0, 0, 0, ..., 17, 1, 7],
[ 0, 0, 0, ..., 9, 70, 110],
...,
[473, 7, 40, ..., 14, 0, 986],
[104, 0, 66, ..., 4, 136, 318],
[116, 0, 34, ..., 160, 144, 0]], shape=(1529, 26178))
sc.pl.highest_expr_genes(adata, n_top=20)
QC and processing¶
In this step we;
- Doublet detection. Doublets can affect downstream analysis so it essential to detecet and remove them
- Identify Mitochondrial and Ribosomal genes. We use these as indicators of cell quality
- We calculate QC metrics that we use to filter low quality cells
%matplotlib inline
import scrublet as scr
from scipy.stats import median_abs_deviation
from biomart import BiomartServer
import time
import io
Doulet Detection¶
Only 1 cell detected as a doublet. Before filtering, we want to look at other metrics
scrub = scr.Scrublet(adata.X)
doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2,
min_cells=3,
min_gene_variability_pctl=85,
n_prin_comps=30)
Preprocessing... Simulating doublets... Embedding transcriptomes using PCA... Calculating doublet scores... Automatically set threshold at doublet score = 0.56 Detected doublet rate = 0.1% Estimated detectable doublet fraction = 12.1% Overall doublet rate: Expected = 10.0% Estimated = 0.5% Elapsed time: 3.0 seconds
scrub.plot_histogram();
adata.obs['predicted_doublets'] = predicted_doublets
adata.obs['doublet_scores'] = doublet_scores
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.obs[adata.obs['predicted_doublets']]
| predicted_doublets | doublet_scores | n_genes | |
|---|---|---|---|
| E7.9.574 | True | 0.772152 | 9128 |
Annotation of Mitochondrial and Ribosomal Genes¶
- A small percentage of genes are ribosomal.
- No mitochondrial genes were detected.
Quality Control (QC) and Outlier Removal
- The QC plots appear mostly satisfactory.
- To remove outliers, we apply the Median Absolute Deviation (MAD) method. Outlier Filtering Method: For each QC metric M, we compute its MAD and exclude data points that fall outside the range: [Median(M) − nmads ⋅ MAD, Median(M) + nmads ⋅ MAD] Where:
Mis the metric of interestnmads = 5
adata.var['MT'] = adata.var_names.str.startswith('MT-')
adata.var['RIBO'] = adata.var_names.str.startswith(("MRPS", "MRPL"))
adata.var[(adata.var['MT']) | (adata.var['RIBO'])]
| n_cells | MT | RIBO | |
|---|---|---|---|
| gene_names | |||
| MRPL1 | 1464 | False | True |
| MRPL10 | 1486 | False | True |
| MRPL11 | 1517 | False | True |
| MRPL12 | 1525 | False | True |
| MRPL13 | 1521 | False | True |
| ... | ... | ... | ... |
| MRPS36 | 1489 | False | True |
| MRPS5 | 1522 | False | True |
| MRPS6 | 1482 | False | True |
| MRPS7 | 1529 | False | True |
| MRPS9 | 1507 | False | True |
81 rows × 3 columns
sc.pp.calculate_qc_metrics(adata, qc_vars=['MT', 'RIBO'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_MT', 'pct_counts_RIBO'],
jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_RIBO')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
def is_outlier(adata, metric: str, nmads: int):
M = adata.obs[metric]
outlier = (M < np.median(M) - nmads * median_abs_deviation(M)) | (
np.median(M) + nmads * median_abs_deviation(M) < M
)
return outlier
adata
AnnData object with n_obs × n_vars = 1529 × 23633
obs: 'predicted_doublets', 'doublet_scores', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_MT', 'pct_counts_MT', 'total_counts_RIBO', 'pct_counts_RIBO'
var: 'n_cells', 'MT', 'RIBO', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
adata.obs["outlier"] = (
is_outlier(adata, "total_counts", 5)
| is_outlier(adata, "n_genes_by_counts", 5)
)
adata.obs.outlier.value_counts()
outlier False 1499 True 30 Name: count, dtype: int64
print(f"Total number of cells: {adata.n_obs}")
adata = adata[(~adata.obs.outlier)].copy()
print(f"Number of cells after filtering of low quality cells: {adata.n_obs}")
Total number of cells: 1529 Number of cells after filtering of low quality cells: 1499
adata = adata[adata.obs.pct_counts_MT < 5, :].copy()
adata = adata[adata.obs.pct_counts_RIBO < 5, :].copy()
sc.pl.scatter(adata, x='total_counts', y='pct_counts_RIBO')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_RIBO'],
jitter=0.4, multi_panel=True)
Normalize¶
We normalize our data so its comparable between cells
adata.raw = adata
# normalize
sc.pp.normalize_total(adata, target_sum=None, inplace=True)
adata.layers["normalized"] = adata.X.copy()
# Log-transform the normalized data
sc.pp.log1p(adata)
adata.layers["log1p"] = adata.X.copy()
Clustering¶
In this section we:
- indentify highly variable genes (HVG) that we can use in dimensionality reduction and clustering where we can identtify cell types and lineages in our data. The HVG are spread out across mean expressions of genes (This is a good sign)
- Carry out pca and use this during clustering
- We also look at how embryo lineage marker genes are expressed across the clusters
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000, inplace=True)
sc.pl.highly_variable_genes(adata)
sc.pp.pca(adata, n_comps=50, use_highly_variable=True)
sc.pl.pca_loadings(adata, include_lowest=False)
sc.pl.pca(adata, color=['KRT19', 'ALPL', 'IL6R'])
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)
sc.pp.neighbors(adata, n_pcs=37)
sc.tl.umap(adata)
resolutions = [0.1, 0.3, 0.6, 1, 1.5, 2, 2.5, 3]
for res in resolutions:
sc.tl.leiden(adata, resolution=res, key_added='leiden_' + str(res), flavor="igraph" )
adata
AnnData object with n_obs × n_vars = 1499 × 23633
obs: 'predicted_doublets', 'doublet_scores', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_MT', 'pct_counts_MT', 'total_counts_RIBO', 'pct_counts_RIBO', 'outlier', 'leiden_0.1', 'leiden_0.3', 'leiden_0.6', 'leiden_1', 'leiden_1.5', 'leiden_2', 'leiden_2.5', 'leiden_3'
var: 'n_cells', 'MT', 'RIBO', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden_0.1', 'leiden_0.3', 'leiden_0.6', 'leiden_1', 'leiden_1.5', 'leiden_2', 'leiden_2.5', 'leiden_3'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
layers: 'normalized', 'log1p'
obsp: 'distances', 'connectivities'
sc.pl.umap(adata, color=['leiden_0.1', 'GATA3', 'SOX2', 'NANOG', 'CDX2'], ncols=2)
sc.tl.rank_genes_groups(adata, 'leiden_0.1', method='wilcoxon', use_raw=False, key_added='rank_leiden_0.1')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key='rank_leiden_0.1', ncols=3)
num_clusters = adata.obs['leiden_0.1'].nunique()
cluster_markers = {}
for cluster in range(num_clusters):
cluster_markers[f"{cluster}"] = list(sc.get.rank_genes_groups_df(adata, str(cluster), key='rank_leiden_0.1').head(10).names)
Querying additional info¶
Here we just query additional genomic information that we can use for karyotype inference
server = BiomartServer("http://www.ensembl.org/biomart")
dataset = server.datasets['hsapiens_gene_ensembl']
gene_names = list(adata.var_names)
def query_genes_by_name(gene_names, batch_size=300):
results = []
for i in range(0, len(gene_names), batch_size):
batch = gene_names[i:i+batch_size]
try:
response = dataset.search({
'filters': {'external_gene_name': batch},
'attributes': ['ensembl_gene_id', 'external_gene_name', 'chromosome_name', 'start_position', 'end_position']
})
df = pd.read_csv(io.StringIO(response.text), sep="\t", header=None)
df.columns = ['ensembl_gene_id', 'gene_name', 'chromosome', 'start', 'end']
results.append(df)
time.sleep(0.2) # avoid rate-limiting
except Exception as e:
print(f"Batch {i // batch_size + 1} failed: {e}")
return pd.concat(results, ignore_index=True)
# usage:
gene_table = query_genes_by_name(gene_names)
Batch 36 failed: No columns to parse from file Batch 37 failed: No columns to parse from file Batch 38 failed: No columns to parse from file
gene_table
| ensembl_gene_id | gene_name | chromosome | start | end | |
|---|---|---|---|---|---|
| 0 | ENSG00000291368 | ACR | HG1311_HG2539_PATCH | 74316 | 81453 |
| 1 | ENSG00000291420 | ACTG1 | HG1369_PATCH | 165618 | 180052 |
| 2 | ENSG00000154734 | ADAMTS1 | 21 | 26835755 | 26845409 |
| 3 | ENSG00000139826 | ABHD13 | 13 | 108218392 | 108234243 |
| 4 | ENSG00000154930 | ACSS1 | 20 | 25006230 | 25058980 |
| ... | ... | ... | ... | ... | ... |
| 23225 | ENSG00000227124 | ZNF717 | 3 | 75678660 | 75785583 |
| 23226 | ENSG00000178917 | ZNF852 | 3 | 44491766 | 44510636 |
| 23227 | ENSG00000203995 | ZYG11A | 1 | 52842511 | 52894998 |
| 23228 | ENSG00000235079 | ZRANB2-AS1 | 1 | 71047497 | 71067483 |
| 23229 | ENSG00000116996 | ZP4 | 1 | 237877864 | 237890922 |
23230 rows × 5 columns
gene_table = gene_table.set_index('gene_name')
gene_table
| ensembl_gene_id | chromosome | start | end | |
|---|---|---|---|---|
| gene_name | ||||
| ACR | ENSG00000291368 | HG1311_HG2539_PATCH | 74316 | 81453 |
| ACTG1 | ENSG00000291420 | HG1369_PATCH | 165618 | 180052 |
| ADAMTS1 | ENSG00000154734 | 21 | 26835755 | 26845409 |
| ABHD13 | ENSG00000139826 | 13 | 108218392 | 108234243 |
| ACSS1 | ENSG00000154930 | 20 | 25006230 | 25058980 |
| ... | ... | ... | ... | ... |
| ZNF717 | ENSG00000227124 | 3 | 75678660 | 75785583 |
| ZNF852 | ENSG00000178917 | 3 | 44491766 | 44510636 |
| ZYG11A | ENSG00000203995 | 1 | 52842511 | 52894998 |
| ZRANB2-AS1 | ENSG00000235079 | 1 | 71047497 | 71067483 |
| ZP4 | ENSG00000116996 | 1 | 237877864 | 237890922 |
23230 rows × 4 columns
gene_table_unique = gene_table[~gene_table.index.duplicated(keep='first')]
gene_table_unique
| ensembl_gene_id | chromosome | start | end | |
|---|---|---|---|---|
| gene_name | ||||
| ACR | ENSG00000291368 | HG1311_HG2539_PATCH | 74316 | 81453 |
| ACTG1 | ENSG00000291420 | HG1369_PATCH | 165618 | 180052 |
| ADAMTS1 | ENSG00000154734 | 21 | 26835755 | 26845409 |
| ABHD13 | ENSG00000139826 | 13 | 108218392 | 108234243 |
| ACSS1 | ENSG00000154930 | 20 | 25006230 | 25058980 |
| ... | ... | ... | ... | ... |
| ZNF830 | ENSG00000198783 | 17 | 34961540 | 34963777 |
| ZNF717 | ENSG00000227124 | 3 | 75678660 | 75785583 |
| ZYG11A | ENSG00000203995 | 1 | 52842511 | 52894998 |
| ZRANB2-AS1 | ENSG00000235079 | 1 | 71047497 | 71067483 |
| ZP4 | ENSG00000116996 | 1 | 237877864 | 237890922 |
20139 rows × 4 columns
adata.var = pd.merge(adata.var, gene_table_unique, left_index=True, right_index=True, how='left')
standard_chroms = [str(i) for i in range(1, 23)] + ["X", "Y"]
adata.var['chromosome'] = adata.var['chromosome'].apply(
lambda x: f'chr{x}' if str(x) in standard_chroms else x
)
adata.var
| n_cells | MT | RIBO | n_cells_by_counts | mean_counts | pct_dropout_by_counts | total_counts | highly_variable | means | dispersions | dispersions_norm | ensembl_gene_id | chromosome | start | end | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gene_names | |||||||||||||||
| A1BG | 571 | False | False | 571 | 32.089601 | 62.655330 | 49065 | True | 3.459785 | 6.432617 | 2.748121 | ENSG00000121410 | chr19 | 58345178.0 | 58353492.0 |
| A1BG-AS1 | 194 | False | False | 194 | 1.126880 | 87.311969 | 1723 | False | 0.628735 | 2.736047 | -1.337072 | ENSG00000268895 | chr19 | 58347718.0 | 58355455.0 |
| A1CF | 151 | False | False | 151 | 2.355788 | 90.124264 | 3602 | False | 1.242799 | 4.065357 | 0.062150 | ENSG00000148584 | chr10 | 50799409.0 | 50885675.0 |
| A2M | 133 | False | False | 133 | 1.567691 | 91.301504 | 2397 | False | 0.894380 | 4.125932 | 0.457751 | ENSG00000175899 | chr12 | 9067664.0 | 9116229.0 |
| A2M-AS1 | 57 | False | False | 57 | 0.431001 | 96.272073 | 659 | False | 0.372899 | 3.480338 | 0.711840 | ENSG00000245105 | chr12 | 9065163.0 | 9068689.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ZYG11A | 1316 | False | False | 1316 | 107.223676 | 13.930674 | 163945 | False | 4.571177 | 5.136084 | 0.431149 | ENSG00000203995 | chr1 | 52842511.0 | 52894998.0 |
| ZYG11B | 1137 | False | False | 1137 | 46.550033 | 25.637672 | 71175 | False | 3.711862 | 4.281110 | -0.598707 | ENSG00000162378 | chr1 | 52726453.0 | 52827336.0 |
| ZYX | 1479 | False | False | 1479 | 251.349248 | 3.270111 | 384313 | False | 5.375675 | 5.315331 | 0.537163 | ENSG00000285443 | HG708_PATCH | 659.0 | 10475.0 |
| ZZEF1 | 1292 | False | False | 1292 | 77.043165 | 15.500327 | 117799 | False | 4.287999 | 4.669287 | -0.139401 | ENSG00000074755 | chr17 | 4004445.0 | 4143030.0 |
| ZZZ3 | 1454 | False | False | 1454 | 188.452583 | 4.905167 | 288144 | False | 5.108596 | 5.221984 | 0.388034 | ENSG00000036549 | chr1 | 77562416.0 | 77683419.0 |
23633 rows × 15 columns
Karyotype Inference¶
In this section we use infercnv to infer large-scale CNV (deletions or gains)
import infercnvpy as cnv
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
cnv.tl.infercnv(adata)
In the heat map, the color Scale represents the inferred relative expression level of genes.
- Dark Blue Indicates lower than average gene expression for a given genomic region. This can be interpreted as a putative copy number deletion (loss).
- White: Indicates average or baseline gene expression, suggesting a normal diploid state for that genomic region.
- Red: Indicates higher than average gene expression for a given genomic region. This can interpreted as a putative copy number amplification (gain).
In our plot we see siginificant number of deletions and gains across most chromosomes for cluster 5
cnv.pl.chromosome_heatmap(adata, groupby="leiden_0.1")
We calculate CNV scores for each cell, defined as the mean of the absolute values of the inferred CNV signal across all genomic regions (genes).
To identify aneuploid vs. euploid cells, we use the 90th percentile of the CNV score distribution as a threshold:
- cells with a CNV score greater than or equal to this threshold are classified as aneuploid, and the rest as euploid
cnv.tl.pca(adata)
cnv.pp.neighbors(adata,)
cnv.tl.leiden(adata, flavor="igraph")
cnv.tl.cnv_score(adata)
adata
AnnData object with n_obs × n_vars = 1499 × 23633
obs: 'predicted_doublets', 'doublet_scores', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_MT', 'pct_counts_MT', 'total_counts_RIBO', 'pct_counts_RIBO', 'outlier', 'leiden_0.1', 'leiden_0.3', 'leiden_0.6', 'leiden_1', 'leiden_1.5', 'leiden_2', 'leiden_2.5', 'leiden_3', 'cnv_leiden', 'cnv_score'
var: 'n_cells', 'MT', 'RIBO', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'ensembl_gene_id', 'chromosome', 'start', 'end'
uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden_0.1', 'leiden_0.3', 'leiden_0.6', 'leiden_1', 'leiden_1.5', 'leiden_2', 'leiden_2.5', 'leiden_3', 'leiden_0.1_colors', 'rank_leiden_0.1', 'cnv', 'cnv_neighbors', 'cnv_leiden'
obsm: 'X_pca', 'X_umap', 'X_cnv', 'X_cnv_pca'
varm: 'PCs'
layers: 'normalized', 'log1p'
obsp: 'distances', 'connectivities', 'cnv_neighbors_distances', 'cnv_neighbors_connectivities'
sc.pl.umap(adata, color=['leiden_0.1', 'cnv_score'])
adata.obs["cnv_score"].describe()
count 1499.000000 mean 0.040579 std 0.022114 min 0.015034 25% 0.023933 50% 0.037499 75% 0.050227 max 0.104277 Name: cnv_score, dtype: float64
sns.histplot(adata.obs["cnv_score"], bins=100, kde=True);
plt.title("CNV Score Distribution");
threshold = adata.obs["cnv_score"].quantile(0.9).item()
adata.obs["is_aneuploid"] = adata.obs["cnv_score"] >= threshold
Around 10% of the cells are annotated as aneuploids and they are found mostly in cluster 5 and 3
sc.pl.umap(adata, color=["is_aneuploid","leiden_0.1", "cnv_score"], ncols=2, wspace=0.2)
adata.obs['is_aneuploid'].value_counts()
is_aneuploid False 1340 True 159 Name: count, dtype: int64
Annotating cell types¶
In this section, we annotate clusters to determine cell types. In embryos (E3-E7), we would normally expect to have the following cell lineages:
- Trophectoderm and Inner Cell Mass (ICM) -> (first differentiation)
- The TE gives rise to the placenta and the ICM differentiates into EPI and PrE
- Epiblast (EPI) and Primitive Endoderm(PrE)
- The pluripotent EPI differentiates into the germ layers and the PrE differentiates into extraembryonic membranes
We define marker genes for each cell type from literature and try to annotate the identified cell clusters. We also overlay apoptosis and proliferation markers
marker_genes = {
'Trophectoderm(TE)' : ['GATA2','GATA3', 'TEAD3', 'KRT18'],
'Inner Cell Mass (ICM)' : ['PRSS3', 'NANOGNB', 'CYP26A1', 'MFN1'],
'Epiblast (EPI)': ['NANOG', 'SOX2', 'KLF17', 'TDGF1'],
'Primitive Endoderm': ['SOX17', 'PDGFRA', 'GATA6', 'GATA4'],
'apoptotic': ['BCL2', 'BAD', 'FAS', 'BAK1'],
'proliferating': ['MKI67','PCNA', 'CDK6', 'CDK1', 'CDC25A'],
'stress markers': ['HSF1', 'TP53', 'CCL2', 'SOD2']
}
Based on expression of lineage markers, clusters were annotated into major embryonic lineages. However, lineage-specific marker expression was often overlapping, reflecting the transcriptional plasticity of early-stage cells. Alot of the marker gene expression was ambigious.
Marker analysis enabled annotation of embryonic lineages (TE, EPI, PE). Marker gene expression patterns suggested widespread lineage plasticity. Cell cycle-related genes were highly expressed across all clusters, consistent with rapid proliferation.
Cluster 3, annotated as ICM/EPI, showed relatively higher expression of apoptotic markers compared to other clusters, suggesting increased cellular stress or potential susceptibility to apoptosis within this population.
sc.pl.dotplot(adata, marker_genes, groupby='leiden_0.1', use_raw=False, layer="log1p")
sc.pl.matrixplot(adata, marker_genes, groupby='leiden_0.1', use_raw=False)
cluster_to_celltype = {
'0': 'TE',
'1': 'TE',
'2': 'TE',
'3':'ICM/EPI',
'4':'TE',
'5':'EPI',
'6':'TE',
'7':'TE/ICM',
}
sc.pl.rank_genes_groups_dotplot(
adata, groupby="leiden_0.1", standard_scale="var", n_genes=5, key="rank_leiden_0.1"
)
adata.obs['cell_type'] = adata.obs['leiden_0.1'].map(cluster_to_celltype)
Cluster 5, annotated as epiblast, were completely aneuploid. Some cells in cluster 3, annotated as epiblast/inner cell mass wwere also aneuploid.
The epiblast is the cell lineage that forms the germ layers of the fetus, and its viability is highly dependent on having the correct number of chromosomes. Cells in cluster 5 have a high potential for developmental failure and should ideally be expressing apoptosis markers.
Similarly, the presence of aneuploidy in some cells of Cluster 3 (epiblast/inner cell mass) indicates that chromosomal instability may arise very early in development, even before the formation of the distinct epiblast lineage.
sc.pl.umap(adata, color=["is_aneuploid","cell_type", "BAK1", "leiden_0.1"], ncols=2, wspace=0.2, legend_loc="on data")
sc.tl.paga(adata, groups="leiden_0.1")
Here we look at the PAGA (Partition-based graph abstraction) graph which can be interpretated as developmental trajectory or relationship between the clusters. The nodes are the clusters and the thickness of an edge indicates a stronger connection, suggesting a continuous developmental trajectory or close relationship between the clusters.
The left plot clearly separates the trophoblast lineage from the inner cell mass and epiblast lineages, establishing the correct developmental relationships between the major cell lineages.
The right plot maps the karyotype status onto this very same developmental landscape. Aneuploid cells are located in the ICM and epiblast lineages, with more aneuploid cells in the epiblast lineage, indicating a progression of increased aneuploidy with further differentiation.
cluster_to_celltype_paga = {0: 'TE', 1: 'TE', 2: 'TE', 3:'ICM/EPI', 4:'TE', 5:'EPI', 6:'TE',7:'TE/ICM'}
sc.pl.paga(adata, color=['cell_type', 'is_aneuploid'],
threshold=0.03, labels=cluster_to_celltype_paga)
Enrichment analysis¶
In this section, we perform differential gene expression (DE) analysis between karyotype groups (euploid vs. aneuploid). The goal is to identify genes and biological pathways associated with chromosomal abnormalities.
from gprofiler import GProfiler
import matplotlib.pyplot as plt
import seaborn as sns
num_clusters = adata.obs['leiden_0.1'].nunique()
cluster_markers = {}
for cluster in range(num_clusters):
sc.get.rank_genes_groups_df(adata, str(cluster), key='rank_leiden_0.1').head(100)[['names']].to_csv(f"{cluster}_specific_genes.txt", header=False, index=False)
adata.obs['karyotype'] = ['aneuploid' if x else "euploid" for x in adata.obs['is_aneuploid']]
sc.tl.rank_genes_groups(adata, groupby="karyotype", method='wilcoxon', use_raw=False,
key_added='rank_euploid', reference="euploid")
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False, key='rank_euploid', ncols=2)
df_ranked = sc.get.rank_genes_groups_df(adata, group="aneuploid", key="rank_euploid")
df_ranked.head(200)[['names']].to_csv("aneuploid_DEG.txt", header=False, index=False)
df_ranked_up = df_ranked[(df_ranked['logfoldchanges'] > 0) & (df_ranked['pvals_adj'] < 0.05)] \
.sort_values(by="logfoldchanges", ascending=False)
df_ranked_up.head(200)[['names']].to_csv("aneuploid_DEG_upregulated.txt", header=False, index=False)
df_ranked_up
| names | scores | logfoldchanges | pvals | pvals_adj | |
|---|---|---|---|---|---|
| 3954 | LOC101927394 | 4.272121 | 10.703187 | 1.936225e-05 | 5.415886e-05 |
| 3848 | TRIM64C | 4.401271 | 10.439366 | 1.076187e-05 | 3.079865e-05 |
| 1113 | UBTFL1 | 9.708819 | 10.371655 | 2.765208e-22 | 2.417690e-21 |
| 1809 | ZNF705G | 7.877468 | 9.305820 | 3.340825e-15 | 1.877615e-14 |
| 516 | LEUTX | 11.841393 | 9.112036 | 2.384616e-32 | 4.066062e-31 |
| ... | ... | ... | ... | ... | ... |
| 5423 | POLR2E | 2.700232 | 0.015247 | 6.929123e-03 | 1.473686e-02 |
| 3026 | NOP58 | 5.590942 | 0.012913 | 2.258407e-08 | 8.009142e-08 |
| 5205 | EIF5A | 2.925202 | 0.011742 | 3.442323e-03 | 7.593094e-03 |
| 5760 | EIF3IP1 | 2.414706 | 0.010531 | 1.574789e-02 | 3.187478e-02 |
| 4752 | RNPS1 | 3.374563 | 0.006622 | 7.393305e-04 | 1.766872e-03 |
5918 rows × 5 columns
df_ranked_down = df_ranked[(df_ranked['logfoldchanges'] < 0) & (df_ranked['pvals_adj'] < 0.05)] \
.sort_values(by="logfoldchanges")
df_ranked_down.head(200)[['names']].to_csv("aneuploid_DEG_downregulated.txt", header=False, index=False)
df_ranked_down
| names | scores | logfoldchanges | pvals | pvals_adj | |
|---|---|---|---|---|---|
| 21677 | TIMP4 | -8.796244 | -31.847162 | 1.414718e-18 | 9.839324e-18 |
| 21430 | GBGT1 | -8.257070 | -31.374468 | 1.492923e-16 | 9.180910e-16 |
| 19072 | TFF1 | -4.174750 | -29.543272 | 2.983137e-05 | 8.211098e-05 |
| 18776 | CXCR6 | -3.774220 | -29.463697 | 1.605089e-04 | 4.106643e-04 |
| 18612 | IL36RN | -3.543146 | -29.296831 | 3.953844e-04 | 9.709184e-04 |
| ... | ... | ... | ... | ... | ... |
| 3814 | MRPL16 | 4.453687 | -0.016038 | 8.440831e-06 | 2.434491e-05 |
| 5800 | CCNDBP1 | 2.374499 | -0.010735 | 1.757281e-02 | 3.532348e-02 |
| 5427 | EXOSC10 | 2.698875 | -0.006995 | 6.957428e-03 | 1.479173e-02 |
| 5795 | LIMA1 | 2.378665 | -0.003434 | 1.737547e-02 | 3.495655e-02 |
| 5751 | DHX15 | 2.424298 | -0.000406 | 1.533800e-02 | 3.109307e-02 |
6119 rows × 5 columns
Discussion of Results¶
Epigenetic Dysregulation
Due to aneuploidy, we see upregulated terms such as DNA methylation-dependent constitutive heterochromatin formation and methyl-CpG binding, which point toward increased epigenetic activity. This is consistent with previous findings that genomes with chromosome gains and losses tend to be either hypermethylated or hypomethylated, leading to widespread transcriptional dysregulation (Wang et al., 2024). The enrichment of these epigenetic terms suggests that aneuploid cells attempt to compensate for genomic instability through altered chromatin structure and DNA methylation.
Impaired Developmental Signaling & Cell Differentiation
In aneuploid cells, we observe upregulation of several homeobox transcription factors such as DUXA and IPF1, both of which are known to play essential roles in early embryonic development and cell fate specification (Bürglin, 2010). Additionally, Smad2, a key component of the TGF-β signaling pathway that influence cell proliferation and differentiation, is also upregulated.
We hypothesize that this transcriptional dysregulation contributes to the downregulation of key biological processes involved in embryonic development, including:
- Developmental process
- Cell differentiation
- Tissue development
- Blood vessel morphogenesis
- Circulatory system development
Impaired Cell-Cell Communication & Extracellular Signaling
The down regulation of biological process terms such as response to endogenous stimulus, response to chemical and cell surface receptor signalling pathway indicate a broad failure of the aneuploid cells to properly send, receive, and respond to signals, which are essential for coordinating the complex behaviors of a developing embryo.
gp = GProfiler(return_dataframe=True)
up_reg_go = gp.profile(organism='hsapiens', query=list(df_ranked_up.head(200)['names']))
up_reg_go
up_reg_go["label"] = up_reg_go["source"] + ":" + up_reg_go["name"]
up_reg_go["log_p"] = -np.log10(up_reg_go["p_value"])
up_reg_go
up_reg_go_sorted = up_reg_go.sort_values(by=["source", "log_p"], ascending=[True, False])
up_reg_go_sorted
sns.set_theme(style="darkgrid")
plt.figure(figsize=(10, 15))
sns.barplot(
x="log_p",
y="label",
data=up_reg_go_sorted,
estimator=sum,
hue="source",
dodge=False,
width=1);
plt.xlabel("-log10(p-value)")
plt.ylabel("Term")
plt.title("Up Regulated Terms", fontsize=20);
down_reg_go = gp.profile(organism='hsapiens', query=list(df_ranked_down.head(200)['names']))
down_reg_go
down_reg_go["label"] = down_reg_go["source"] + ":" + down_reg_go["name"]
down_reg_go["log_p"] = -np.log10(down_reg_go["p_value"])
down_reg_go
down_reg_go_sorted = down_reg_go.sort_values(by=["source", "log_p"], ascending=[True, False])
down_reg_go_sorted
sns.set_theme(style="darkgrid")
plt.figure(figsize=(10, 40))
sns.barplot(
x="log_p",
y="label",
data=down_reg_go_sorted,
estimator=sum,
hue="source",
dodge=False,
width=1);
plt.xlabel("-log10(p-value)")
plt.ylabel("Term")
plt.title("Down Regulated Terms", fontsize=20);
up_reg_go.to_csv("upregulated_go_terms.csv", sep="\t")
down_reg_go.to_csv("downregulated_go_terms.csv", sep="\t")
adata.write_h5ad('data/cell_fate_2_processed_annotated.h5ad', compression="gzip")
adata = sc.read_h5ad('data/cell_fate_2_processed_annotated.h5ad')
Conclusion¶
In conclusion, aneuploidy profoundly impacts the identity and functional state of early human embryonic cells, leading to the failure of developmental processes. Our analysis of single-cell RNA sequencing data reveals a dysfunctional transcriptional identity and a widespread collapse of functional cellular states. Aneuploid cells lose their proper embryonic identity, as shown by impaired developmental signaling and differentiation. This compromised identity manifests as a severe breakdown in the functional state of the embryo, with the downregulation of terms related to cell differentiation and tissue development.
This loss of functional identity and cellular integrity is likely the reason why aneuploid embryos have low viability